Skip to content

feat(billing): distributed lock for multi-pod crons#3688

Merged
PierreBrisorgueil merged 8 commits into
masterfrom
feat/billing-cron-distributed-lock
May 22, 2026
Merged

feat(billing): distributed lock for multi-pod crons#3688
PierreBrisorgueil merged 8 commits into
masterfrom
feat/billing-cron-distributed-lock

Conversation

@PierreBrisorgueil
Copy link
Copy Markdown
Contributor

@PierreBrisorgueil PierreBrisorgueil commented May 21, 2026

Summary

  • Adds lib/distributedLock.js — Mongo TTL-doc lock primitive (acquireLock / releaseLock, holder-bound release, app-level expiry check decoupled from Mongo's 60s TTL monitor)
  • Integrates lock in 3 billing crons (weeklyReset / dunningSweep / extrasExpiration) with skip-on-contention behavior
  • Bonus fix : restore (errors > 0 || desyncErrors > 0) condition in dunningSweep exit code (pre-existing regression — K8s alerts were silent on Org.plan desync failures)
  • 8 unit tests + 6 integration tests + RUNBOOK entry

Context — audit P1 (2026-05-21)

Devkit audit identified billing crons race risk when K8s concurrencyPolicy: Forbid is bypassed (e.g. pod crash post-jitter pre-finalize). Distributed lock provides defense-in-depth.

Reviews completed before push :

  • In-session code-quality review (2 Important + 4 Minor) → fixed in commit d3264b8b
  • DeepSeek pre-push gate iter 1 BLOCK → 1 HIGH (dunningSweep exitCode) + 1 MEDIUM (releaseLock error masking) + 1 LOW + 1 NIT → fixed in commit b971b2cc
  • DeepSeek pre-push gate iter 2 OK with nits → 1 low + 2 nits remaining, deemed cosmetic

Doc-sync ack

  • RUNBOOKS.md numbered section is ## 6 — Cron lock stuck ; OK since master section count was 5
  • README.md updated with ## Concurrency control section ; OK

Test plan

  • NODE_ENV=test npm run test:unit — 1493 tests pass
  • NODE_ENV=test npm run test:integration -- billing.cron-lock — 6 tests pass
  • NODE_ENV=devkit npm run lint — clean
  • Pre-existing \$sortArray integration failures unrelated (also failing on master)
  • CI green
  • CodeRabbit clean

Plan

docs/superpowers/plans/2026-05-21-devkit-audit-p0-p1-fixes.md Phase 1.

Summary by CodeRabbit

  • New Features

    • Added distributed lock system to billing cron jobs, preventing duplicate execution across multiple pods.
    • Implemented automatic lock expiration with fallback TTL to handle stale locks.
  • Documentation

    • Added operational runbook for diagnosing and resolving stuck cron lock scenarios.
    • Added concurrency control documentation describing lock behavior and expected log messages.
  • Tests

    • Added unit tests for lock acquisition and release mechanisms.
    • Added integration tests verifying lock behavior under concurrent access.

Review Change Stack

Mongo-backed TTL doc lock with auto-expiry. Acquired via findOneAndUpdate
upsert with stale-or-absent predicate; released by holder match.

Addresses devkit audit P1 (2026-05-21) — billing crons race when K8s
concurrencyPolicy bypass occurs (e.g. crash post-jitter pre-finalize).
weeklyReset, dunningSweep, extrasExpiration now acquire a distributed
lock before mutating subscriptions. Skip-on-contention with info log;
release in finally for clean recovery.

Also fixes Mongoose 8 deprecation: `new: true` → `returnDocument: 'after'`
in findOneAndUpdate.

Closes audit P1 — multi-pod race on crash post-jitter pre-finalize.
- Move distributedLock unit test to canonical lib location (or document deviation)
- Add releaseLock throw-path test + non-fatal comment in 3 crons' finally
- Harmonize log style in skip-on-contention path
- Rename + document integration test (in-process concurrency clarification)
- Add findOne verification step in RUNBOOKS before deleteOne

Addresses code-quality review I1, I2, M1, M3, M4.
- HIGH: restore (errors || desyncErrors) condition in dunningSweep exit code
  (regression introduced when wrapping with lock — would make K8s alerts silent)
- MEDIUM: wrap releaseLock in try/catch within finally to preserve original
  work error if both throw
- LOW: guard acquireLock against invalid ttlMs (must be positive finite)
- NIT: drop redundant setDefaultsOnInsert (all fields explicitly $set)

Addresses gate iteration 1 BLOCK.
Copilot AI review requested due to automatic review settings May 21, 2026 19:57
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 21, 2026

Warning

Rate limit exceeded

@PierreBrisorgueil has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 17 minutes and 48 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: b5b718ec-1b99-405e-880d-d6448fd53aad

📥 Commits

Reviewing files that changed from the base of the PR and between b971b2c and b232b68.

📒 Files selected for processing (7)
  • lib/services/distributedLock.js
  • lib/services/tests/distributedLock.unit.tests.js
  • modules/billing/crons/README.md
  • modules/billing/crons/billing.dunningSweep.js
  • modules/billing/crons/billing.extrasExpiration.js
  • modules/billing/crons/billing.weeklyReset.js
  • modules/billing/tests/billing.cron-lock.integration.tests.js

Walkthrough

This PR introduces MongoDB-backed distributed locking to prevent concurrent execution of billing cron jobs across multiple Kubernetes pods. Three cron scripts now acquire a lock before executing, skip work if the lock is unavailable, and release the lock via TTL expiry or explicit deletion.

Changes

Distributed lock for billing cron concurrency

Layer / File(s) Summary
Distributed lock schema and operations
lib/distributedLock.js, lib/tests/distributedLock.unit.tests.js
Mongoose cron_locks collection with TTL index; acquireLock performs atomic upsert-only-if-missing/expired with duplicate-key detection; releaseLock deletes only when holder matches. Unit tests mock MongoDB and verify success, holder mismatch, error handling, validation, and timestamp windows.
Distributed lock integration tests
modules/billing/tests/billing.cron-lock.integration.tests.js
Real MongoDB tests verify acquisition on empty state, rejection when held by another pod, successful acquisition after TTL expiry, holder-matching deletion, single-winner concurrency under contention, and lockedUntil correctness.
Billing cron lock integration
modules/billing/crons/billing.weeklyReset.js, modules/billing/crons/billing.dunningSweep.js, modules/billing/crons/billing.extrasExpiration.js
Each cron imports randomUUID and lock utilities, generates a unique holder ID, acquires lock before work, exits(0) if unavailable, executes core sweep logic only when locked, sets exit code from error count, and releases lock in finally with non-fatal error logging.
Lock operational documentation and runbooks
modules/billing/RUNBOOKS.md, modules/billing/crons/README.md
README documents lock purpose, auto-expiry, per-cron lock names with 5–10 minute TTLs, and expected skip messages. Runbook #6 covers stuck-lock symptoms, TTL-wait guidance, inspection commands, manual deletion procedures, and prevention advice.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: adding distributed lock support to multi-pod billing cron jobs, which is the core objective of the entire changeset.
Description check ✅ Passed The description comprehensively covers summary, context, scope, validation, test results, and reviewers' notes. It follows the template structure with all critical sections completed and provides detailed context including audit findings, review iterations, and specific test results.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/billing-cron-distributed-lock

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 21, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.66%. Comparing base (da02947) to head (b232b68).
⚠️ Report is 3 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3688      +/-   ##
==========================================
+ Coverage   89.62%   89.66%   +0.03%     
==========================================
  Files         139      140       +1     
  Lines        4744     4759      +15     
  Branches     1488     1491       +3     
==========================================
+ Hits         4252     4267      +15     
  Misses        385      385              
  Partials      107      107              
Flag Coverage Δ
integration 59.71% <86.66%> (+0.08%) ⬆️
unit 66.00% <100.00%> (+0.10%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update da02947...b232b68. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a MongoDB-backed distributed lock primitive to prevent multi-pod billing CronJobs from running concurrently, integrates it into the key billing crons, and adds documentation + tests to validate the locking contract end-to-end.

Changes:

  • Add lib/distributedLock.js (Mongo TTL-doc based lock) with unit + integration tests.
  • Wrap weeklyReset, dunningSweep, and extrasExpiration billing crons with acquire/skip/release lock behavior.
  • Document operational handling for stuck locks in the billing runbooks and cron README.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
modules/billing/tests/billing.cron-lock.integration.tests.js New integration coverage for lock acquire/release/expiry behavior against real MongoDB.
modules/billing/RUNBOOKS.md Adds a runbook entry for diagnosing and clearing stuck cron locks.
modules/billing/crons/README.md Documents cron lock purpose, TTLs, and expected “skipping” logs.
modules/billing/crons/billing.weeklyReset.js Adds distributed lock acquisition/release around weekly reset execution.
modules/billing/crons/billing.extrasExpiration.js Adds distributed lock acquisition/release around extras expiration sweep.
modules/billing/crons/billing.dunningSweep.js Adds distributed lock acquisition/release around dunning sweep and restores exit-code behavior on desync errors.
lib/tests/distributedLock.unit.tests.js New unit tests with mocked Mongoose for lock behavior and error handling.
lib/distributedLock.js New lock primitive (CronLock model + acquireLock/releaseLock).

Comment thread modules/billing/crons/billing.dunningSweep.js
Comment thread modules/billing/crons/billing.extrasExpiration.js
Comment thread modules/billing/crons/billing.weeklyReset.js Outdated
Comment thread modules/billing/crons/billing.extrasExpiration.js Outdated
Comment thread modules/billing/crons/billing.dunningSweep.js Outdated
Comment thread modules/billing/tests/billing.cron-lock.integration.tests.js Outdated
coderabbitai[bot]
coderabbitai Bot previously requested changes May 21, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@lib/distributedLock.js`:
- Around line 52-69: The current acquireLock in distributedLock.js only sets
lockedUntil once (in acquireLock) which allows the lease to expire during long
runs; add a renewal/heartbeat mechanism and a safe renew API: implement a
renewLock({name, holder, ttlMs}) that performs a conditional findOneAndUpdate
matching {_id: name, holder} and sets lockedUntil to now+ttlMs (return true only
if holder matches), call that renew periodically from the long-running worker
while it holds the lock (e.g., setInterval cleared on finish) and ensure
renewLock validates ttlMs like acquireLock; alternatively implement a fencing
pattern by embedding a monotonic epoch/token (holder already used) and always
require the holder match when extending or releasing so a stale process cannot
steal the lock.
- Around line 19-35: The new shared distributed lock (LockSchema and exported
CronLock) must live under the approved shared code locations; either move this
module into lib/helpers/ or lib/services/ and update all imports that reference
CronLock to the new path, or if there is a deliberate exception, add a
top-of-file comment justifying why it must remain at lib/distributedLock.js
(include owner, rationale, and link to the exemption approval) and run a
repo-wide search to update/verify imports; ensure the exported symbol CronLock
and the schema name LockSchema remain unchanged so dependent modules still
resolve correctly.

In `@modules/billing/crons/billing.dunningSweep.js`:
- Around line 119-122: The logger.error call in the dunningSweep lock-release
failure currently passes metadata first and message second; change it to pass
the message string first and the metadata object second (i.e., flip the
arguments for logger.error) so it matches the rest of the file and other billing
crons—update the call that references LOCK_NAME and releaseErr in the
dunningSweep lock-release error handling to use logger.error(message, { err:
releaseErr, cron: LOCK_NAME }).

In `@modules/billing/crons/billing.extrasExpiration.js`:
- Around line 89-92: The logger.error call that currently passes meta first
should be changed to message-first ordering to match other calls: call
logger.error with the string '[cron.extrasExpiration] failed to release lock —
will auto-expire on TTL' as the first argument and the metadata object { err:
releaseErr, cron: LOCK_NAME } as the second; update the logger.error invocation
that references releaseErr and LOCK_NAME in the extrasExpiration cron to use
this ordering so breadcrumb format is consistent.

In `@modules/billing/crons/README.md`:
- Around line 136-139: Update the runbook section reference in the README:
replace the text `## Cron lock stuck` with the exact header `## 6 — Cron lock
stuck` so the link/mention matches the actual section title in the billing
runbook; edit the sentence in modules/billing/crons/README.md that currently
reads "See the runbook entry `## Cron lock stuck` in
`modules/billing/RUNBOOKS.md` for manual resolution." to use the exact header
`## 6 — Cron lock stuck`.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 845b6d9b-4d37-45e6-a5d9-3131a62e9fa2

📥 Commits

Reviewing files that changed from the base of the PR and between da02947 and b971b2c.

📒 Files selected for processing (8)
  • lib/distributedLock.js
  • lib/tests/distributedLock.unit.tests.js
  • modules/billing/RUNBOOKS.md
  • modules/billing/crons/README.md
  • modules/billing/crons/billing.dunningSweep.js
  • modules/billing/crons/billing.extrasExpiration.js
  • modules/billing/crons/billing.weeklyReset.js
  • modules/billing/tests/billing.cron-lock.integration.tests.js

Comment thread lib/services/distributedLock.js
Comment thread lib/services/distributedLock.js
Comment thread modules/billing/crons/billing.dunningSweep.js Outdated
Comment thread modules/billing/crons/billing.extrasExpiration.js Outdated
Comment thread modules/billing/crons/README.md
- Move lib/distributedLock.js → lib/services/distributedLock.js
  (aligns with existing convention for service-level primitives)
- Move lib/tests/distributedLock.unit.tests.js → lib/services/tests/
- Update import paths in 3 billing crons + integration tests
- Flip logger.error arg order in dunningSweep + extrasExpiration finally
  (match file convention: (message, meta) not (meta, message))
- Same fix applied to weeklyReset (same pattern found, not flagged by CR)
- Update modules/billing/crons/README.md → reference '## 6 — Cron lock stuck'
  (exact match with RUNBOOKS section header)

Heartbeat/renewLock suggestion intentionally NOT added — see PR reply.
@PierreBrisorgueil
Copy link
Copy Markdown
Contributor Author

@coderabbitai Thanks for the thorough review. Quick context on the 5 findings:

#2-#5 addressed in commit 50a7a80a :

  • moved distributedLock.js to lib/services/ (aligns with existing primitives like analytics.featureFlags.js)
  • updated import paths in all 3 crons + integration test
  • flipped logger.error arg order to (message, meta) in dunningSweep + extrasExpiration finally blocks — also applied the same fix to weeklyReset which had the same pattern (CR missed it)
  • updated README cron lock reference to exact ## 6 — Cron lock stuck header

#1 (renewLock heartbeat) — intentionally not added. Rationale:

  • TTLs are sized 2-3x typical exec time per RUNBOOKS section 6 (10min weekly vs <2min typical, 15min dunning vs <5min, 5min extras vs <1min)
  • If a cron runs past TTL, the lock auto-expiring is the intended recovery path adding a heartbeat would mask slow queries that need investigation, not protection
  • YAGNI: no concrete operational scenario currently requires renewal. Easy to add later if/when a long-running billing op emerges.

Happy to revisit if you have a specific scenario in mind that the current TTL margin does not cover.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 21, 2026

Tip

For best results, initiate chat on the files or code changes.

@PierreBrisorgueil: Thanks for the clear and complete write-up.

#2#5 — All four look good. Appreciate the proactive weeklyReset logger fix too — that was indeed a missed instance in the original review.

#1 (renewLock heartbeat) — accepted. The rationale holds up well:

  • A 2–3× TTL margin over typical exec time is a sound operational buffer.
  • Treating TTL auto-expiry as the recovery path rather than a failure mode is a clean design for short-lived crons — a heartbeat would genuinely obscure runaway queries that should be investigated.
  • The RUNBOOK entry already covers the "lock stuck" case for operators.

The design is intentional and documented, which is exactly the right place to leave it. No further action needed on #1.

🐇 ✨


✏️ Learnings added
Learnt from: PierreBrisorgueil
URL: https://github.com/pierreb-devkit/Node/pull/3688

Timestamp: 2026-05-21T20:12:01.289Z
Learning: In the billing distributed lock design (`lib/services/distributedLock.js`), a `renewLock` heartbeat was intentionally omitted. TTLs are sized 2–3× typical exec time (10 min weekly reset vs <2 min typical; 15 min dunning sweep vs <5 min; 5 min extras expiration vs <1 min). TTL auto-expiry is the *intended* recovery path — adding a heartbeat would mask slow queries that need investigation. RUNBOOK section 6 ("Cron lock stuck") documents the manual intervention procedure.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

@PierreBrisorgueil PierreBrisorgueil dismissed coderabbitai[bot]’s stale review May 21, 2026 20:21

CR replied accepting all 5 findings (4 fixed in 50a7a80, #1 rationale accepted via PR comment). Dismissing stale CHANGES_REQUESTED state per project convention.

… clarify lock test name

Resolves Copilot MissingSchemaError finding: dunningSweep and extrasExpiration
called mongooseService.connect() without loadModels(), risking MissingSchemaError
on repository top-level mongoose.model() calls. Matches weeklyReset/reconcile pattern.

Also renames test 'ttlMs < 0' → 'existing lock has expired (lockedUntil < now)'
to accurately describe the scenario being exercised.
@PierreBrisorgueil PierreBrisorgueil merged commit d397764 into master May 22, 2026
8 checks passed
@PierreBrisorgueil PierreBrisorgueil deleted the feat/billing-cron-distributed-lock branch May 22, 2026 09:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants